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ABSTRACT 

Querying uncertain data sets (represented as probability 
distributions) presents many challenges due to the large 
amount of data involved and the difficulties comparing un- 
certainty between distributions. The Earth Mover's Dis- 
tance (EMD) has increasingly been employed to compare 
uncertain data due to its ability to effectively capture the 
differences between two distributions. Computing the EMD 
entails finding a solution to the transportation problem, 
which is computationally intensive. In this paper, we pro- 
pose a new lower bound to the EMD and an index structure 
to significantly improve the performance of EMD based K- 
nearest neighbor (K-NN) queries on uncertain databases. 

We propose a new lower bound to the EMD that approxi- 
mates the EMD on a projection vector. Each distribution is 
projected onto a vector and approximated by a normal dis- 
tribution, as well as an accompanying error term. We then 
represent each normal as a point in a Hough transformed 
space. We then use the concept of stochastic dominance to 
implement an efficient index structure in the transformed 
space. We show that our method significantly decreases K- 
NN query time on uncertain databases. The index structure 
also scales well with database cardinality. It is well suited 
for heterogeneous data sets, helping to keep EMD based 
queries tractable as uncertain data sets become larger and 
more complex. 

1. INTRODUCTION 

Uncertain data is becoming more prevalent as the data 
generation capabilities of many scientific tools increases. The 
accuracy of many computational methods can be improved 
when the data uncertainty is retained, as little information 
is lost from the data acquisition phase to the analysis phase. 

Uncertain data is frequently represented as probability 
distributions. Many traditional querying techniques suffer 
significant performance degradation when operating on dis- 
tributions, due to the complexity of determining distance 
between uncertain objects. To address the issue of distance. 

Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the fuU citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. Articles from this volume were invited to present 
their results at The 38th International Conference on Very Large Data Bases, 
August 27th - 31st 2012, Istanbul, Turkey 
Proceedings of the VLDB Endowment, Vol. 5, No. 3 
Copyright 2011 VLDB Endowment 2150-8097/11/11... $ 10.00. 



Ambuj K. Singh 
University of California, Santa Barbara 
Santa Barbara, CA, USA 

ambuj@cs.ucsb.edu 



the Earth Mover's Distance (EMD) has increasingly been 
utilized to query uncertain databases due to its ability to 
accurately retrieve similar distributions. 

The EMD is a metric for computing the distance between 
two discrete probability distributions. The intuition be- 
hind the EMD is that it computes the minimum amount 
of "work" or "flow" required to transform one distribution 
into another. This property has made the EMD popular 
in recent years, finding use in image retrieval [15], cluster 
comparison [23] and shape matching [6]. 

The EMD is computationally expensive to derive, as the 
theoretical time complexity of the EMD is exponential in 
the number of distribution bins. Although empirically the 
complexity of the EMD is usually cubic in the number of 
bins [15], this high computation cost is still a significant 
bottleneck. 

Nearest neighbor queries using the EMD require that the 
exact EMD between distributions must be computed, yet 
doing so on an entire database is prohibitive. Pruning ob- 
jects based on EMD lower bounds has proven successful in 
reducing the time needed to answer nearest neighbor queries. 
Wichterich et al. [20] have shown that reducing the number 
of bins in the distributions can successfully prune candidate 
objects from the answer set. Recently, Xu et al. [21] have 
exploited the dual solution of the transportation problem to 
construct a B+ tree over a database of uncertain objects, 
which is then used to speed up query processing. 

The above pruning methods suffer from some significant 
disadvantages. The reduction method proposed in [20] is im- 
plemented in the scan-and-refine architecture (SAR), and 
hence suffers from scalability shortcomings as an index struc- 
ture is not implemented. The index proposed in [21] is con- 
structed using pre computed feasible solutions. The query- 
ing performance is contingent upon finding fcEisible solutions 
that are an accurate reflection of the underlying data set, 
and each feasible solution requires the use of two B-|- trees. 
As the size of the database and the heterogeneity of the data 
increases, the number of B+ trees must also be increased in 
order to keep query times tractable. Hence these two meth- 
ods suffer from scalabiUty problems as the database cardi- 
nality and diversity is increased. 

There is a clear need to develop a tight lower bound to 
the EMD that can be efficiently indexed and scaled to large 
datasets. Cohen et al. proposed a lower bound that projects 
distributions onto a vector, and computes the EMD in 1- 
dimension on the projection [4]. This bound is extremely 
tight and can be computed linearly in the number of bins. 
Figure 1 shows the average lower bound as a percentage 
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Figure 1: Lower bounds on the RETINA data set. 
Each bar plots the average percent of the total EMD 
that the lower bound captures. The projection lower 
bound is a very tight lower bound. 

of the actual EMD on a data set used in [20, 21]. The 
projection bound is highly accurate and a tighter bound 
than many other proposed lower bounds. 

The projection bound [4] has not been utilized in K~ 
NN queries due to the difficulty in indexing the bound. 
The projection bound is equivalent to the Lj distance be- 
tween the cumulative distribution functions (CDF) of the 
1-dimensional discrete distributions. Indexing the projec- 
tion bound is diflicult due to the high dimensionality of the 
Li computation (the number of discrete distribution bins), 
coupled with the use of the Li distance that renders many 
dimensionality reduction methods ineffective. For example, 
the number of bins in the projection bound computation 
cannot be reduced using Fourier decomposition as there is 
no known equivalent of Parseval's theorem for the Li dis- 
tance metric [13]. Hence, the development of a lower bound 
that is nearly as accurate as the projection bound but re- 
quires low Li dimensionality to compute would be extremely 
advantageous, as such a bound could be easily indexed using 
many common index structures. 

In this paper, we propose a new lower bound to the EMD 
that approximates the projection lower bound. Our pro- 
posed bound combines an approximating normal distribu- 
tion and an error term to compute a lower bound to the 
EMD in constant time. We show that this lower bound can 
be embedded into a novel low dimension index space and 
computed in 0(s) time, where s is a small user defined pa- 
rameter. We use the concept of stochastic dominance to suc- 
cessfully prune many potential nearest neighbor candidate 
objects. We test our method on several data sets, including 
a data set of over 600,000 distributions where our method is 
over twice as fast as the previous method. Our contributions 
can be summarized as follows: 

• We develop a new lower bound to the EMD, called the 
normal lower bound. 

• We show that the normal lower bound can be employed 
in a novel low dimension index that is constructed us- 
ing the concept of stochastic dominance. 

• We demonstrate that the normal lower bound signif- 
icantly decreases K-NN query time, and scales well 
with large data sets. 

2. RELATED WORK 

The EMD is closely related to the family of mass trans- 
portation problems [14]. These problems are broadly con- 
cerned with the optimal movement of mass, fiow or probabil- 
ity between two sets of data. The EMD was first introduced 



into the computer vision communities by Rubner et al. [15], 
though it has subsequently been shown to be an effective 
distance metric for many tasks, including comparing images 
[19] and shapes [18]. 

Numerous lower bounds to the EMD have been explored 
since it has been shown to be an effective metric for image 
retrieval and comparison [1, 2, 4, 11, 20]. Several papers 
have explored the use of the EMD in the scan-and-refine 
(SAR) querying architecture [1, 2, 20]. These methods focus 
on deriving accurate and efficient lower bounds to the EMD 
so some false candidates can be pruned away, though they 
suffer from scalability problems without the use of an index. 

Xu et al. [21] has introduced a method that exploits the 
dual solution to the transportation problem to build a lower 
bound B+ tree index structure. This method, called TBI, 
creates several B-|- trees based on a feasible solution to the 
EMD from a fixed data set. The B-|- trees are then used at 
query time to eliminate candidates from a nearest neighbor 
query. The performance of the method is contingent upon 
finding feasible solutions that have high pruning power. Like 
the reduced EMD method, this may be unsatisfactory for a 
database that is changing frequently or composed of het- 
erogeneous data. In addition, each feasible solution imple- 
mented requires two B-(- trees, resulting in poor scalability 
as the diversity and size of the data set increases. 

Indexing the normal lower bound is similar to the more 
general problem of indexing monotonic functions. There has 
been significant research on methods to index general func- 
tions, such as time series or probability distributions. For 
example, Keogh et. al. [9] and Yi and Faloutsos [22] propose 
approximating 1-dimensional time series data by adaptively 
dividing the time series into constant segments. They then 
build an index structure based on such constant segments. 
Ljosa and Singh [12] proposed a similar method to index ex- 
pected distance functions or CDFs using non-constant line 
segments. These methods have serious shortcomings if ap- 
plied to the EMD indexing problem. The methods proposed 
in [9] and [22] approximate arbitrary functions by a constant 
line segment, and hence require large amounts overhead to 
approximate a monotonic CDF. These indexing methods 
also focus on the L2 distance. Many dimensionality reduc- 
tion techniques that are successful for the L2 distance do 
not provide any benefit for the Li distance [3]. 

The method in [12] is more suitable for approximating a 
CDF. However, this method was initially proposed to index 
the distance from a point to a CDF or expected distance 
function. Adapting the index to compute the Li distance 
between two CDFs would prove difficult, as each line seg- 
ment in the approximation would have to be visited to com- 
pute the distance, due to the absolute value. Our method 
approximates CDFs by normal distribution CDFs, which 
are guaranteed to intersect at most once when the variances 
are unequal. In addition, there exists a closed form solu- 
tion to the Li distance between two normal CDFs, enabling 
computation of the distance in constant time. 

3. EARTH MOVER'S DISTANCE 

The Earth Mover's Distance, as defined in [15], measures 
the minimum cost required to transform one histogram into 
another. While the formulation in [15] is generalized for 
histograms, we present a simple definition for distributions. 

Given distribution weights P = (pi,---p„), distribution 
weights Q = (gi, • • • q„), and a set of distribution bin loca- 
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Figure 2: Figure 2(a): 1— dimensional EMD between projected distributions P and Q. The EMD is the 
sum of the shaded areeis of the figure, which is the absolute difference between the two distributions' CDFs. 
Figure 2(b): An overview of the normal lower bound index. Distributions are projected onto a vector. 
The distributions are fit to normal distributions and the errors are pre— computed. The normals are Hough 
transformed, and along with the errors, are used to create bounding regions in a quad— tree. 



tions B = {h\,---hn) that are common to both P and Q, 
the total cost between P and Q is 



Furthermore, if S = {S\, 
axes in R'', d' < d, then 



. Sd' ) is a set of orthogonal 



i=i j=i 

whore fij is the flow between bins bi and bj, and Cij is the 
cost to move flow from bi to bj. The choice of the cost 
is determined for each specific ijroblcm. For simplicity, we 
subsequently restrict our discussion to the L2 norm, though 
the method can be applied to other cost structures. The 
EMD is then defined as 

EMD{P,Q) = min F{P,Q), subject to: 

n n 

fij > and ^ fij < Pi and ^ fij < qj 

i=i i=i 

The EMD is the minimum cost needed to transform one 
distribution into another. The constraints ensure that the 
flow out of a bin is nonnegative and not more than the total 
weight in each bin. 

The solution to the EMD attempts to minimize the dis- 
tance that the weights pi must move to equal the weights g^. 
When the distance between bins bi and bj is small, the solu- 
tion will maximize the amount of flow between the bins; if 
large, it will minimize the amount of flow between the bins. 

In general, solutions to the EMD use algorithms from lin- 
ear programming, such as the transportation simplex [7]. 
The EMD has an empirically observed time complexity of 
approximately 0{n^) [15], where n is the number of bins in 
distribution. Hence, even small problem sizes can require 
significant time to compute. 

3.1 EMD Projections 

We briefly detail the EMD projection lower bound, as 
shown in [4]. If Sj is a unit vector in R"*, then 

EMD{P,Q) > EMD{projsj{P),projsj{Q)) 

where projsj (P) is a projection of P onto the vector Sj (and 
similarly for Q). That is 

WOjSj{P) = (Pi • ■■Pn,tl, . . .in), ti = Sj ■ bi 

where the weights are denoted as pi, and the projected bins 



EMD{P, Q)>—Y^ EMD{projs, {P),projs, {Q)) 

The EMD between P and Q is lower bounded by the sum 
of the EMDs on a set of orthogonal vectors, divided by the 
square root of the number of vectors. The EMD on a single 
projection can be computed in 0{n) time, a much more effi- 
cient computation than the full EMD. The following method 
of computing the EMD along a 1-dimensional projection is 
from [4]. 

Theorem 1. Let Cp^Sj{t) denote the CDF of a discrete 
distribution P projected onto Sj. Then given Cp^Sj{t) and 
CQ,Sj {t), the EMD on the projection is 



Cp,s,{t)-CQ^Sj{t) 



EMD(projsj {P),projsi (Q)) =^ 

where tmin and tmax are the minimum and maximum pro 
jected bin values. 

Proof. See [4], Theorem 4. □ 



dt 



Figure 2(a) shows the CDFs of two distributions on a 
projection. The EMD is the sum of the shaded areas in the 
figure, since that is exactly equivalent to the difference in 
the CDFs of the two distributions. 

The computation of the EMD in this manner essentially 
amounts to the Li distance between two n-dimensional vec- 
tors. Note that this Li distance between the CDFs is not 
related to the cost to move fiow between the original bins. 
Computing the projection EMD may be efficient compared 
to the cubic time requirement of the full EMD, but the Li 
distance between such large vectors is not easily indexed, 
due to the curse of dimensionality and the difficulty of di- 
mensionality reduction for such a distance [3]. Thus, we 
introduce a new lower bound based on normal distributions 
that is easily indexed. 

4. NORMAL LOWER BOUND 

Figure 2(b) presents a simple overview of the normal lower 
bound and the subsequent indexing method. The bound 
utilizes the 1-dimensional projection EMD. There are two 
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components to the normal lower bound; a distance between 
normal approximations to the projected distributions, and 
accompanying errors of the approximations. The normal ap- 
proximations are subsequently transformed into points using 
Hough transformations, and combined with the error terms 
to create bounding regions in this transformed space. We 
detail construction of the index in Section 5. Throughout 
this section and next, we assume that each distribution is 
projected onto a single vector, hence we omit the projection 
subscript Sj. We will detail integration of multiple projec- 
tion vectors in Section 5.3. 

4.1 Normal Approximation 

Given a 1-dimensional projected distribution P, we ap- 
proximate P by a normal distribution M{iip,Op), where 
and (Tp are the mean and variance of the projected distribu- 
tion P. 

Definition 1. Normal CDF Integration Let N{fip,ap) 
be a normal distribution, and let $p be the CDF of the 
normal. Then the area under the CDF of the normal in the 
tmax] is defined as 



range of ^tmm , ''max 

rt 



$p(t) dt = t- $p(t) + (j>p{t) 



where (jip is the probability density function of a normal 
distribution. 

Definition 1 is well known from the integration of the er- 
ror (erf) function, which is equivalent to the CDF of the 
standard normal distribution. Despite the absolute value 
in Theorem 1, computing the EMD between 1-dimensional 
normal distributions does not require any numerical integra- 
tion. As shown by Sinha and Zhou [17], two normal CDFs 
with unequal variances intersect at most once, and there is 
a closed form equation to determine the intersection point. 
Let tis be the intersection point between $p and Then 



(1) 



Given the intersection point and the range of integration 
[tmin,tmax\, we deuote the normal EMD as EMD^f{^p , $q), 
which is defined by rewriting Theorem 1 for normal distri- 
butions as 



SMDaa($p,<E'q) 



4- 



<I>p(t) dt - 
^p(t) dt - 



$Q{t)dt 
^Q{t)dt 



If tis lies outside the range [tmin,tmax], only one inte- 
gration is performed over the entire range. Computing the 
normal EMD is a constant time operation due to the closed 
form of the normal CDF integration in Definition 1. 

Next we briefiy introduce the concept of stochastic dom- 
inance. This concept is integral to computing the lower 
bound, as well as for the subsequent indexing of the bound. 

4.1.1 Stochastic Dominance 

Stochastic dominance is the concept that in some fixed 
interval, the CDF of one distribution is always less than 
the CDF of another distribution [10]. Formally, we define 
stochastic dominance as 



Underestimated 
Area 




Figure 3: The intersection of two normals and the 
CDF of one of the normals. The EMD between the 
normals has underestimated the actual projected 
EMD by the area in blue, and overestimated the 
EMD by the amount in red. This scenario assumes 
that Errq is zero. 

Defimtion 2. Stochastic Dominance. A CDF Cp stoc- 
hastically dominates another CDF Cq if 

Cp (t) < Cq (i) V t G [tmin , tmax] 

If Cp dominates Cq, we write Cp < Cq. 

Note that varying definitions of stochastic dominance ex- 
ist, but we use this definition in this work. Observe that 
two distributions may not dominate each other. In such a 
scenario, the two CDFs must intersect at least once, and in 
the case of normal CDFs, exactly once. We define stochastic 
dominance only in a fixed range [tmin, tmax], as outside this 
interval the dominance property is not guaranteed. 

Stochastic dominance provides several properties that as- 
sist in computation of the normal lower bound. First, if 
-< in [tmin, tmax], thcu we kuow that the integration 
of <l>p is less than "I>q in the same range. In addition, if <l>p 
intersects <1>q in the range [tmin, tmax], then we know that 

$F ^ over [tmin, tis] 

-< over [tis, tmax] 

is true, or the converse is true, where tis is the intersection 
point between $p and $q. 

4.2 Error Compensation 

In order to take advantage of the constant time distance 
computation between normals, we must also compensate for 
the error incurred when each CDF is fit to a normal. 

Defimtion 3. Normal Approximation Error. For a 

normal CDF <l?p, we define the normal approximation error 
Errp{t) incurred at point t as 



Errp{t) = Cp{t) - $p(t) 



(2) 



If at point t the CDF of P is greater than the normal ap- 
proximation of P, we have a positive error, and a negative 
error in the reverse scenario. 

Unfortunately, the errors cannot simply be accumulated 
over the entire distribution range and combined with the 
normal EMD to produce the lower bound. This is a result of 
the intersection point between two distributions impacting 
how the errors compensate for the normal approximation. 
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(a) (b) (c) 

Figure 4: Three different intersection points in a sub— interval. The difference between the errors before 
and after an intersection point is needed to compute the normal lower bound. Hence these values are 
pre— computed for a sub— interval, and the minimum and maximum differences are recorded. 



Consider the example in Figure 3, which shows two nor- 
mal approximations <l?p and <1>q, and the CDF Cp. Let us 
assume that is exactly normal, meaning Ettq is zero ev- 
erywhere. Before the intersection point of the normal CDFs, 
$p dominates $q. Therefore, the normal EMD between $p 
and $Q is an underestimate of the projected EMD by the 
amount of negative error in Errp before the intersection, 
shown in blue. Conversely, after the intersection point, the 
dominating relationship has changed. Therefore, the neg- 
ative error is now the amount that the normal EMD has 
overestimated the projected EMD, shown in red. 

Observe in the figure that after the intersection, the pos- 
itive and negative errors will cancel each other out, as the 
positive error is an underestimate while the negative error 
is an overestimate. The entire error after the intersection 
point can be summed together and will be close to zero, 
since the positive and negative areas after the intersection 
are roughly equal. This means that for the example in the 
figure, the projected EMD is at least 

EMDu{^p,^q)~ [ "Errp{t)dt+ f Errp{t)dt (3) 

Note that if <1?q dominated t&p before the intersection, we 
would reverse the signs on the error terms. 

We cannot pre-compute the error integrals in Eqn. (3) to 
compute a lower bound to the projected EMD at query time. 
The integrals in the equation depend upon the intersection, 
and obviously we do not know in advance the intersection 
point between a query and database object. However, in 
lieu of knowing the actual intersection point from a query, 
we break the interval [tmin,tmax] into s sub-intervals, and 
compute a pessimistic estimate of the error summations in 
each sub-interval. 

We visit each potential intersection point in a sub-interval 
and pre-compute the worst case error terms. For example. 
Figure 4 shows one sub-interval with three different poten- 
tial intersection points. At each intersection point the dif- 
ference between the total error before the intersection and 
after the intersection is computed, as in Eqn. (3). In each 
figure, the error in the blue region is accumulated and sub- 
tracted from the accumulation of the error in the red region. 
We then store the minimum and maximum of these values, 
as we do not yet know the sign on the error terms until 
query time (if the error is subtracted we will need the max- 
imum, and the minimum if the error is eventually added). 
The only intersection points we need to consider are where 



the discrete CDF changes or intersects its normal approxi- 
mation, so there are a finite amount of intersection points 
that need to be checked. 

The minimum and maximum error differences are pre- 
computed for all s sub-intervals, which we denote as Errmin,p 
and Err max, p- These error differences can then be easily 
retrieved via a constant lookup at query time and either 
added or subtracted to the normal EMD, depending on how 
the two normals intersect. We perform the same calculation 
of the minimum and maximum error for each query as well 
(at query time, of course). 

Formally, we define Errmin,p as follows 

Errmin.p{tj) = 

min {C" Errp(t)dt - C""''' Errp{t)dt} 
tiselsi.si+i] " 

\ if Si < tj < Si+i 

Jt^'J Errp{t) dt if tj < t^in \j tj > tmax 

with Errmax,p{tj) defined in the same manner but with a 
maximization instead of a minimization. 

4.3 Computing the Normal Lower Bound 

Using stochastic dominance, we can now compute the nor- 
mal lower bound between two distributions on a projection. 
We denote the normal lower bound EMDlb{^p, ^q) in the 
range [t 

EMDlb{^p,^q) = 
' EMDj^{ii>P, <I>q) + Errmin,p{tia) - Errmax.gitis) 

if $Q ^ $P V > O-q 

EMDj^{^P, $q) - Errmax,p{tia) + Errmin.,Q{tis) 

if $p ^ <E>Q V < O-q 

where tis is the intersection point between <E>p and <I>q. 

Theorem 2. 

EMDlb{^p,^q) < EMD(projs^{P),projs^{Q)) 
Proof. See Appendix A. □ 

Computing the normal lower bound is a constant time op- 
eration. With a given query, the EMD between the normals 
is computed using the closed form definite integral. The in- 
tersection point is then computed using Eqn. (1), and the 
pre-computed error terms are retrieved and the bound is 
then computed. 
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5. INDEXING THE LOWER BOUND 

The normal lower bound is a tight lower bound to the 
EMD. However, despite the 0(1) complexity of the bound, 
computing the lower bound on a large database of objects 
can still be very time consuming. An index structure based 
on the normal lower bound would be desirable to handle 
scalability of the data. 

There are several challenges with the bound that must be 
resolved in order to build an effective index on this distance 
metric. First, we must define the space and structure of 
the index. Meaning, we must determine how each projected 
distribution will be represented in the index space. Second, 
we must ensure that computing the normal lower bound 
in the index space is still efficient. We now detail how we 
resolve both of these challenges. 

5.1 Dominance Space 

Normal distributions have the desirable property that st- 
ochastic dominance is preserved in lines that are computed 
using the mean and variance of the distributions. 



Theorem 3. For t e [t 



mini ^max] ? 

t 



Proof. As shown in [17], <!>(- 



is less than $( 



< 



(where 3> is the standard normal CDF) if and only if ■ 
^— The definition of dominance states that function A 
must be strictly less than function B in the given interval. 
Thus < satisfies the definition of dominance and 
the theorem is true. □ 

In order to take advantage of this property we perform 
a Hough transformation [8] on the normal CDFs. ffough 
transformations convert line segments into points in the pa- 
rameter space of the line segments. Using the standard y- 
intercept form of a line, y = m ■ t + b, each normal is rep- 
resented by a slope (rn) and a y-intcrcept (6). We rewrite 
—J^ into the standard y-intercept equation for line. We set 

m = ^ and b = Each normal CDF $p is represented 
as a tuple {mp,bp) in this transformed space. We term this 
transformed space dominance space, because it is easy to 
define dominance relationships in geometric terms. 

Consider Figure 5(a), where we plot y = m-t+b for several 
normals. We denote the blue line as the line for $p, which 
dominates all the black lines. Additionally, the red dotted 
line has no dominating relationship with any of the other 
normals. We observe that at tmin (the smallest t value) the 
dominance relationship is preserved. That is. 



t- 



Hq 



+ bp <mq ■ tmin + bq (5) 



for some $q that is dominated by $p. There exists a region 
in dominance space such that at tmin, the normals defined 

by points in the region are always greater than $p at tmin- 
This region can easily be derived from Eqn. (5) as 

rrip ' tmin ~t~ bp ^ TTlq ' tmin ~t~ bq , V fllq , bq 
{jTlp'tmin + bp) — niq-tmin < bq, W TUq (6) 

tmin'nriq -\- (^TKlp-tmin ~t" &p) ^ bq , ^ TTiq 

That is, we have defined a line in dominance space, with a 
slope of —tmin and a y-intercept of {nip ■ tmin + bp), such 



that all normals with a bq value above the line arc greater 
than $p at tmin- This can be seen in Figure 5(b) as the 
dotted line. The points in dominance space for the black 
lines in Figure 5(a) are all above the dotted line in Figure 
5(b) because at tmin, they are all greater than P. 

Similarly, we can find another region using tmax, as shown 
in Figure 5(c). Note how the red line from Figure 5(a) that 
intersected all the other lines is not above this new dotted 
line, as at tmax, it is less than $p. We denote these lines in 
dominance space as dominance lines. 

From Theorem 3 and Eqn. (6) we get a corollary regarding 
the dominance lines. 

Corollary 1. For t e [tmin, tmax], 

\ bq ^ tmin ' Tflq -|- {fflp ' tmin ~\~ bp) 
t — Hp t — fiq I , 

— -< — <^ { and 

(^p Cq \ , 

I bq ^ tmax ' niq yTTLp - tmax ~t~ bp ) 

Corr. 1 implies that given a point (nip, bp) in dominance 
space, there exists a region where <I>p dominates all other 
normals. Conversely, if we switch the inequalities in Corr. 
1, we can determine the region where $p is dominated by 
all normals, which is the area below the intersection of the 
two lines in Figure 5(c). Any points in dominance space 
not contained in these regions must intersect $p somewhere 
between tmin and tmax- 

Observe that the slope of the dominance lines arc —tmin 
and —tmax- As tmin and tmax are fixed for the databsise 
(since they are based on the projection), all dominance lines 
have the same slope, and thus are parallel. The shape of 
this region is the same no matter where the point (nip, bp) 
of interest is placed. 

Finally, we state the following theorem about dominance 
space. 

Theorem 4. Denote the region that (mp,bp) dominates 
as R, so that $p ^ $r Vr € R- Then for all points (rUq, bq) 
where $q ■< ^p, 

EMDm(^p, $q) < EMDj^(^r, $q), Vr G 7? (7) 
Proof. This follows from the definition of dominance. If 
$p -< then <bp must be less than <br at every point in 
the range, and hence the area under the CDF for $p must 
be less than Since $q ^ <I>p, the area under $q must 
also be less than <l?p, and hence EMDj^{<^p,^q) must be 
less than any EMDj\r(^r, ^q)- □ 

The index is implemented in dominance space, and each 
projected distribution is represented as a point in this space. 
Next we detail how dominance lines and Theorem 4 are uti- 
lized to lower bound a set of points in dominance space. 

5.2 Bounding Regions 

An efficient index structure needs to have the capability of 
pruning large amounts of potential candidates from a K-NN 
query without resorting to computation of the lower bound 
between a query and each database object. This can be ac- 
complished by creating bounding regions around points in 
dominance space, and computing the lower bound between 
a query and a bounding region. If the lower bound between 
the query and the bounding region is too large to be con- 
sidered as a nearest neighbor, then all the points within the 
bounding region can be pruned with a single computation. 
The bounding regions that are utilized in the index are based 
on dominance lines. 
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All normals above 
this line are greater 
than Pat t. 
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Must intersect 

all of ttie 
other normals 



All normals above 
this line are greater 
than P at t^^ 



(a) (b) (c) 

Figure 5: Finding dominance lines in dominance space. At the endpoints of the Hnes in Figure 5(a), the 
dominance property must hold. All points in dominance space that have their endpoints dominated by the 
blue line must be above the two dominating lines in dominance space. Note that the red point has only one 
endpoint dominated by the blue line, hence it is only above one of the dominating lines in Figure 5(c). 






(a) Complete Dominance 



(b) Partial Dominance 



Slope (m) 

(c) No Dominance 



Figure 6: A bounding region in dominance space, and examples of the three scenarios where the distance 
from a point Q to a bounding region must be considered. In Figure 6(a), Q is completely dominated by 
every point in M. In Figure 6(b), Q is partially dominated by A/, and in Figure 6(c), Q has no dominating 
relationship with Al (it must intersect every point in AI). Note that every point Q must fall into one of these 
scenarios, though the positions in dominance space may be different than depicted in the figures. 



Definition 4- Bounding Region. Given a set of points 
in dominance space, we define the bounding region (BR) AI 
of the points by two points Adi and Adu in dominance space, 
where 

Ml = {mi,bi) : $m, -< $m, V e M 
AIu = (niujbu) ■■ "I>mi -< \/ nue Ad 

Recall from the previous section that the slope of the dom- 
inating lines are solely defined by \tmin,tmax\- As a result, 
only two points are needed to define a bounding region, since 
the other two points can be derived from the intersection of 
the dominating lines emanating from M; and M^. This is 
similar to Euclidean space where only two points are needed 
to define a rectangle. 

For the moment, we assume that tmin < and tmax > 0. 
In such a scenario, M forms a diamond shaped bounding re- 
gion in dominance space. Figure 6(a) shows the dominating 
region for a series of points shown in blue. The bounding 
region is the area enclosed by the solid black lines, with AIu 
being top of the diamond region, and Mi the bottom. 

We can easily compute the normal lower bound between 
a query Q and all rrii £ M. However, in order to prune 
all the points in M, we need to compute the normal lower 
bound between Q and the BR Ad. Unfortunately, comput- 
ing a lower bound to a bounding region is not as simple to 
compute as the normal lower bound. 



First, we define the minimum and maximum error differ- 
ences for a BR similar to a point as 

Errmax,M{tis) = max Errmax,mi{tis) 

Meaning, the errors of some sub-interval Si for a bound- 
ing region are just the minimum and maximum of the sub- 
interval Si for any point within M. 

We denote the normal lower bound specifically for bound- 
ing regions as EMDsRiAd, ^q). We must show that 

EAdDBR{M, $q) < £ML»z,s($„,,$q) Vm, e M 

in order to prune a bounding region from a candidate set. 

There are three different ways that EMDbr{M, $q) is 
computed, depending on where the query point is relative 
to M. An example of each scenario is shown in Figures 6(a), 
6(b) and 6(c). Similar to the normal lower bound between 
points, the lower bound for bounding regions is composed 
of a normal EMD and error terms. The normal EMD term 
for EMDbr{M, $q) is the minimum normal EMD between 
Q and any point in M. Determining this minimum distance 
point greatly depends on where Q is relative to M, hence 
the three different ways that EMDBR{Ad, $q) is computed. 
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5.2.1 Complete Dominance 

The complete dominance scenario is shown in Figure 6(a). 
In this scenario, Q is completely dominated by the entire 
BR. 

From Theorem 4, we know 
*A/„ ^ 'I'Q => 

Mu is the point in M that has the minimum normal distance 
to Q. EMDbr{M,$q) is then computed as 

EMDbr{M,^q) = 

EMD^f{$M^,^Q) - Err,„ax,M{tis) + Errmin,Q{Us) 

Note that if Q ^ Mj, we use Mi instead of Alu and the 
appropriate minimum or maximum errors. 

5.2.2 Partial Dominance 

The partial dominance scenario is depicted in Figure 6(b). 
In this case, the query point Q is dominated by Mi but not 
Mu, and as a result, we cannot use Theorem 4 to find the 
minimum distance normal. 

Fortunately, since the EMD is a metric, we can use the tri- 
angle inequality to lower bound the minimum normal EMD 
from Q to Af . We denote the intersection point of the dom- 
inating line from Q with M as Mis, shown as the red point 
in Figure 6(b). We know that the minimum distance point 
must lie above the intersection of the dominance line from 
Q with M (red in the figure), as follows from the complete 
dominance scenario in the previous subsection. 

We define EMDbr{M, $q) in the partial dominance sce- 
nario as 

EMDbr{M,^q) = 

\ [EMDm{^m^,^q) + EMDm{$m,^,,^q) 

(8) 

~EMD^f{^^^,,'^MJ] - max{Errmax,M{si)} 
+ min{Err„^i„^Q{si)} 

The triangle inequality is used to bound the minimum nor- 
mal distance, and the maximum and minimum errors over 
all sub-intervals from M and Q are used to ensure that the 
error terms are always less than any errors from a point in- 
side M. If Q ^ Mu but not Mi, then we use Mi instead of 
Mu and the appropriate minimum or maximum errors. 

Theorem 5. If a query point Q is dominated by Mi but 
not Mu, then 

EMDbr{M,^q) < EMDLB{^m,,^Q)'im^ e M 

where EMDbr{M,^q) is computed as defined in Eqn. (8). 

Proof. See Appendix (B) □ 

5.2.3 No Dominance 

The last scenario is depicted in Figure 6(c). In this case, 
Q has no dominance relationship with any points in AI. 
This case is just an extension of partial dominance. We de- 
note Mis as the intersection point between the dominance 
lines from Mi and M„, as depicted in red on the figure. 
We use the triangle inequality to bound the minimum dis- 
tance normal using Mi and Mis, then repeat using M„ and 
Mis , and take the minimum value. The rest of the terms in 
EMDbr{M,$q) for partial dominance are the same. 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

Slope <m) 

Figure 7: Quad— tree of bounding regions for a data 
set. The blue points are the database objects, and 
the black lines are dominating bounding regions. 

We have now defined the normal lower bound between 
points in dominance space, as well as a lower bound that is 
used between a point and a bounding region. We can now 
group points in dominance space, and prune large amounts 
of candidate objects with only a single lower bound compu- 
tation. Computing the distance from Q to a BR AI requires 
finding the minimum/maximum over all the error terms, and 
with s sub-intervals, is an 0(s) operation. 

5.3 Index Implementation 

The index structure implemented is a quad-tree. Quad- 
trees provide a simple method to define bounding regions, 
since the regions are based on parallel dominating lines. We 
can also always ensure that we have a diamond shape to the 
BR regions by subtracting the mean of the projected bins. 
This ensures that tmin < and tmax > 0, and has no effect 
on the EMD as it is translation invariant. 

Given a data set and a projection Sj, we project each 
distribution onto Sj and fit a 1-dimensional normal distri- 
bution to each projection. We pre-compute the errors in 
each of the s sub-intervals for each distribution. For sim- 
plicity, we evenly divide the range from [t,nin,tm.ax] into s 
sub-intervals. Finally, we convert each normal distribution 
to a point in dominance space, determine the bounding re- 
gion of the entire database, and recursively build the tree. 
Figure 7 shows an example of the quad-tree for a small sam- 
ple data set. 

At query time, we find potential nearest neighbors by per- 
forming a best-first traversal of the quad-tree that utilizes 
a threshold based on the current fc*'' nearest neighbor. Due 
to space considerations, we refer the reader to [16] for more 
details on quad-trees and nearest neighbor querying using 
a best-first method. The space complexity analysis of the 
index is discussed in Appendix (C) 

Multiple projections may yield a tighter lower bound than 
a single projection. If more than one projection is used, we 
implement each projection as its own independent index. 
We then use a modified version of Fagin's Threshold Algo- 
rithm [5] to aggregate the lower bounds from each index. 

6. EXPERIMENTS 

To the best of our knowledge, the TBI method proposed 
in [21] is the state of the art method for nearest neighbor 
queries with the EMD. The existing scan-and-refine meth- 
ods do not implement an index structure, and must perform 
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a large sort of the database lower bounds for each query. Im- 
plementation of a lower bound index is more efficient than 
a single sort on the entire database, as the TBI method 
demonstrated nearly twice as much improvement in K-NN 
query time over the previous SAR methods. Therefore, we 
tested our method only against the TBI method. 

As much of the experimental procedure as possible from 
[21] was replicated in our experiments. The TBI method- 
ology uses an index structure and three additional lower 
bound filters that are run after candidates are pulled from 
the index. The filters are run in order: the dual solution 
lower bound, the reduced EMD lower bound, and the inde- 
pendent minimization lower bound. If a candidate passes 
all lower bound filters, the full EMD is performed. The im- 
plcniciitation of the TBI method was the author's original 
implementation, with some slight modifications to keep all 
indices in memory for fair comparison with our method. 

The normal lower bound index was coded in C++ and 
placed into the TBI framework, without the TBI index. The 
normal lower bound index uses the described index struc- 
ture(s), then the full projection lower bound, then the re- 
duced EMD lower bound and independent minimization as 
in the TBI method. All testing was performed on an Intel 
XEON Quad core processor at 2.33Mhz, with 4GB of RAM. 

6.1 Data sets 

We tested our new method on three real data sets. The 
first data set is the RETINA data from [11, 20, 21]. This 
data set consists of 12 MPEG-7 descriptors of 3932 images 
of the retina. Only the first set of descriptors was used in 
the experiments. Each descriptor is an 8 x 12 grid of tiles. 
The descriptors were normalized, giving 3932 2-dimensionaI 
distributions with 96 total bins. The query objects used for 
this data set were the same 100 objects used in [20]. 

The next data set used was the IRMA data set from [20, 
21]. This data set consists of features extracted from 10,000 
medical images. Each distribution is 199 bins over a 40 
dimensional space. The same 100 objects from [20] were 
used to query the database. 

Finally, we used a data set consisting of 681,278 publicly 
available images from the photo sharing site Flickr. The 
data set is a collection of general images, each re-sized to be 
a 640 X 640 image. Each image was divided into a 10 x 10 grid 
of tiles, and the 12 feature MPEG-7 color layout descriptor 
(OLD) was extracted from each tile in an image. Extracting 
a OLD for each tile achieves a more accurate representation 
of the spatial distribution of color in an image than using 
a single OLD for an entire image. The CLDs from each 
image arc then normalized to sum to one. This converts 
the 100 spatial CLDs to a 2-dimensional probability distri- 
bution over the tiles of an image. Each image distribution 
measures the relative value of the CLD in one tile compared 
to the CLDs in other tiles of an image. Two images that 
are close in EMD have similar distribution of CLDs. The 
CLD distributions model the inherent uncertainty of color 
features and image comparison. Like the RETINA data, 
only the first set of descriptors were used. 100 distributions 
were randomly selected and removed from the data set for 
querying. 

Feasible solutions for the TBI method were generated for 
all data sets in the random manner as described in [21], and 
four feasible solutions were employed for each data set, the 
same as in [21]. For the RETINA and IRMA data sets, the 



reduction matrices were the same matrices as used in [20]. 
The reduction lower bound was not used on the Flickr data 
set due to the generality of the data. That is, reduction 
matrices provided no pruning power, and actually resulted 
in a performance decrease due to the overhead of running a 
reduced EMD. For all tests and data sets, the L2-norm is 
used as the distance in the cost matrix. 

The projection vectors for each data set were found via 
principal component analysis. As the RETINA and Flickr 
data sets are two dimensions, we used both principal com- 
ponents as our projection vectors, meaning we utilize two 
indices. The IRMA data contains almost all variance in one 
principal component, hence we only employed one projection 
vector for the IRMA data set. It is possible that there exist 
better projection vectors to use on these data sets. However, 
we demonstrate that even using a simple technique such as 
PGA can yield impressive results. 

6.2 Results 

The only parameters for our tests that we need to vary 
are the number of sub-intervals, and the node capacity in 
the indices. We explore how these parameters effect the 

performance of the method, but we set the default values of 
these parameters as approximately log(n) number of sub- 
intervals (5 for RETINA and Flickr, 6 for IRMA) and a node 
capacity of 100 objects, where n is the number of bins in the 
data set. 

6.2. 1 K-NN Query Time and Number ofEMDs 

Figure 8 presents the query time for each method while 
varying K. For the RETINA and IRMA data sets, we 
present results with the reduction matrix using the origi- 
nal values as presented in [20, 21] (18 and 60, respectively), 
and with increased values (36 and 80, respectively). As will 
subsequently be shown, our method significantly decreases 
the number of full bin EMDs that need to be performed. As 
a result, we must increase the dimensionality of the reduced 
filter, or computation time will be wasted on a lower bound 
filter that has no pruning power. 

As one can see, using the original reduction matrix our 
method reduces the query time. When we take advantage of 
the available increase in pruning power of the reduction me- 
thod, the query time for the normal lower bound index de- 
creases even more. As we increase K, we maintain the speed 
up advantage over the TBI method, and on the RETINA 
and Flickr data sets we see significant improvement as K 
increases. 

On the Flickr data set, the speedup is significant. This 
speed up is the result of the index's ability to prune many 
potential candidates before running the other lower bound 
filters. This can clearly be seen in Figure 9, where we plot 
the number of full bin EMDs performed. The improvement 
in the rmmber of full EMDs performed generally follows the 
pattern of the time speed up. However, on the Flickr data, 
the reduced lower bound method is useless due to the hetero- 
geneity of the data set. Therefore, the TBI method suffers 
significant performance degradation without this extra lower 
bound filter. Our method still successfully prunes many po- 
tential candidates without needing the reduced filter, hence 
we demonstrate its superiority over the TBI method with 
extremely large data sets. 

In addition, the normal index does not require nearly as 
many data accesses as the TBI method. Recall that the TBI 
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(a) RETINA, 100 Queries (b) IRMA, 100 Queries (c) Flictir, 100 Queries 



Figure 8: Average K— NN query time for all data sets. K values up to 8 are shown in the insets for the 
RETINA and IRMA data sets. The normal lower bound index outperforms the TBI method on all data 
sets. The index also enables increased pruning power of the reduced method since the number of full EMDs 
is drastically reduced. On the Flickr data set, the normal index performance over TBI is significant. Both 
methods were unable to use the reduced filter, yet the normal index still retains high pruning power. 




(a) RETINA, 100 Queries (b) IRMA, 100 Queries (c) Flielcr, 100 Queries 

Figure 9: Average number of full EMDs performed. The normal index performs very few full bin EMDs, 
and scales better as K and the database cardinality increases, as demonstrated in the Flickr data set results. 
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Figure 10: Average K— NN query time while varying 
the node capacity. The node capacity has very little 
impact on query time for any of the data sets. 



method must implement two B+ trees for each feasible so- 
lution that is used. Hence, with four feasible solutions, TBI 
implements eight B-1- trees. This results in an extremely 
high data access rate. The TBI method can access more 
candidates than the cardinality of the data set, due to the 
multiple indices accessing the same objects multiple times. 
The normal lower bound index uses only as many indices as 
projections, requiring less access to the data than the TBI 
method. For example, on the Flickr data set, the TBI me- 
thod accesses nearly four times as much data as the normal 
index (not shown). 

6.2.2 Index Parameters 

We next vary the index parameters to determine what 
effect they impart on the performance of the method. We 



vary the number of sub-intervals from 1 to 9, and the node 
capacity of the index from 100 to 400. 

In Figure 10, we examine how changing the internal node 
capacity in the quad-tree affects the overall performance. 
As we can see, there is no detectable performance differ- 
ence. The lower bound computation performed in the index 
is an 0(s) computation for internal nodes, and an 0(1) com- 
putation for leaf nodes. Therefore, given the large amount 
of leaf nodes in the index and a low value of s, changing the 
relatively small number of internal nodes does not affect the 
overall query time significantly. 

In Figure 11(a), we vary the number sub-intervals for all 
three data sets. We see that increasing the number of sub- 
intervals decreases the query time. On the RETINA and 
IRMA data sets, there is a slight decrease. On the Flickr 
data set, the decrease is more significant, as can be seen 
in more detail in Figure 11(b). In Figure 11(b), as K in- 
creases, the difference in query time with more sub-intervals 
becomes significant. Notice that there are diminishing re- 
turns from increasing the number of sub-intervals; as s in- 
creases, we get a more accurate lower bound, but eventually 
it has little impact on the final query time. Finally, we show 
in Figure 11(c) the number of candidate objects that are 
eventually passed from the index to the other lower bound 
filters. As s increases, there is a significant decrease in the 
number of candidates that are passed to the other lower 
bound filters. This is expected, as increasing s tightens the 
normal lower bound in the index. As the database size in- 
creases, a can be increased to increase the pruning power of 
the bound. 
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Figure 11: Effect of the number of sub— intervals s. Figure 11(a) varies the number of sub— intervals for all 
three data sets with K = 4. The number of sub— intervals has a small impact on the RETINA and IRMA 
data sets, but a large impact on the Flickr data. Figure 11(b) varies K for different values of s on the Flickr 
data set. When K gets large, the impact of the number of sub— intervals is significant. Figure 11(c) shows the 
number of candidates that pass the index lower bound for the Flickr data. As s increases, the normal bound 
in the index becomes tighter and less candidates are passed to the subsequent lower bound filters. 



7. CONCLUSION 

In this paper, we have presented a novel method for near- 
est neighbor queries using the Earth Mover's Distance. We 
introduce a new lower bound to the EMD that approxi- 
mates the projected EMD using normal distributions and 
error terms. We develop a novel indexing scheme using st- 
ochastic dominance to improve scalability of nearest neigh- 
bor queries. We show that our method can retrieve nearest 
neighbors significantly faster than previous methods. 

We anticipate expanding and investigating more aspects 
of the normal lower bound index. For example, the index 
requires no use of the projected distribution bins, outside 
of the minimum and maximum bin values. Therefore, our 
method is suitable for databases that contain objects with- 
out static bins, such as shape databases or moving object 
databases. 

The EMD is a computationally expensive distance metric, 
but the accuracy and quality it produces in querying and 
mining makes it highly attractive. As uncertain data sets 
become larger and more prevalent, extremely high quality 
and flexible pruning methods will need to be developed to 
keep query times tractable. We believe our normal lower 
bound index is a significant step towards achieving these 
goals. 

8. ACKNOWLEDGMENTS 

We thank Jia Xu for providing the code to the TBI me- 
thod and Marc Wichterich for providing some of the data 
sets. This work was supported by NSF grant #0808772. 

9. REFERENCES 

[1] I. Assent, A. Wenning, and T. Seidl. Approximation 
techniques for indexing the earth mover's distance in 
multimedia databases. In ICDE, page 11, April 2006. 

[2] I. Assent, M. Wichterich, T. Meisen, and T. Seidl. 
Efficient similarity search using the earth mover's 
distance for large multimedia databases. In ICDE, 
pages 307-316, 2008. 

[3] B. Brinkman and M. Charikar. On the impossibility of 
dimension reduction in 11. Journal of the ACM, 
52(5):766-788, 2005. 



[4] S. D. Cohen and L. J. Guibas. The earth mover's 

distance: Lower bounds and invariance under 

translation. Technical report, 1997. 
[5] R. Fagin, A. Lotem, and M. Naor. Optimal 

aggregation algorithms for middleware. In ACM 

PODS, pages 102-113. ACM, 2001. 
[6] K. Grauman and T. Darrell. Fast contour matching 

using approximate earth mover's distance. IEEE 

CVPR, 1:220-227, June-2 July 2004. 
[7] F. S. Hillier and G. J. Lieberman. Introduction to 

Mathematical Programming. McGraw-Hill, 1990. 
[8] P. Hough. Method and means for recognizing complex 

patterns, Dec. 18 1962. US Patent 3,069,654. 
[9] E. Keogh, K. Chakrabarti, M. Pazzani, and 

S. Mehrotra. Locally adaptive dimensionality 

reduction for indexing large time series databases. In 

ACM SICMOD, pages 151-162, 2001. 
[10] H. Levy. Stochastic dominance and expected utility: 

survey and analysis. Management Science, pages 

555-593, 1992. 
[11] V. Ljosa, A. Bhattacharya, and A. Singh. Indexing 

spatially sensitive distance measures using 

multi-resolution lower bounds. In EDBT, pages 

865-883. Springer, 2006. 
[12] V. Ljosa and A. Singh. Apia: Indexing arbitrary 

probability distributions. In ICDE, pages 946-955. 

IEEE, 2007. 
[13] A. Oppenheim, R. Schafer, and J. Buck. 

Discrete-Time Signal Processing. Prentice Hall, 1999. 
[14] S. Rachev and L. Ruschendorf. Mass Transportation 

Problems: Theory. Springer series in statistics: 

Probability and its applications. Springer, 1998. 
[15] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth 

mover's distance as a metric for image retrieval. Int. 

J. Comput. Vision, 40(2):99-121, 2000. 
[16] H. Samet. Foundations of multidimensional and 

metric data structures. Morgan Kaufmann, 2006. 
[17] D. Sinha and H. Zhou. Statistical timing analysis with 

coupling. , IEEE Transactions on Computer-Aided 

Design of Integrated Circuits and Systems, 

25(12):2965-2975, 2006. 
[18] S. Srinivasan. Local earth mover's distance and face 



215 



warping. In IEEE International Conference on 

Multimedia and Expo, volume 2, pages 1227-1230 

Vol.2, June 2004. 
[19] H. Tan and C. Ngo. Localized matching using Earth 

Mover's Distance towards discovery of common 

patterns from small image samples. Image and Vision 

Computing, 27(10):1470-1483, 2009. 
[20] M. Wichtcrich, I. Assent, P. Kranen, and T. Seidl. 

Efficient emd-based similarity search in multimedia 

databases via flexible dimensionality reduction. In 

ACM SIC MOD, pages 199-212, 2008. 
[21] J. Xu, Z. Zhang, A. Tung, and G. Yu. Efficient and 

effective similarity search over probabilistic data based 

on earth mover's distance. Proceedings of the VLDB 

Endowment, 3 (1-2): 758-769, 2010. 
[22] B. Yi and C. Faloutsos. Fast time sequence indexing 

for arbitrary Ip norms. In VLDB, pages 385-394. 

Citeseer, 2000. 
[23] D. Zhou, J. Li, and H. Zha. A new Mallows distance 

based metric for comparing clusterings. In ICML, 

pages 1028-1035. ACM, 2005. 

APPENDIX 

A. NORMAL LOWER BOUND PROOF 

We present the proof to Theorem 2 that 

EMDlb{'S>p,<S>q) < EMD{projs^{P),projs,{Q)) 
Proof. Without loss of generality, we prove the case 
when $p -< $Q. We define EMDlb{'^p,'^q)' as 

EMDlb{^p, *q)' = ^ """" \^p{t) - *Q(t)| dt 

-[ Errp{t)dt+ [ ErrQ(t)dt 

Jtmin, J train 

From the definition of the minimum and maximum errors 
in Eqn. (4), we know that 

EMDlb{^p,^q) < EMDlb{^p,^q)' 

as EMDlb{^p, ^qY is the computation of the lower bound 
without prc-computing the intersection errors. 
Demonstrating that 

|$p(t) - $Q(t)l - Errp{t) + Errqit) < \Cp{t) - Cgit)\ Vt 

implies that EMDlb{^p, '^q)' is no more than the projec- 
tion EMD, in which case, wo know that EMDlb{^p,^q) 
is then a lower bound to the projection EMD. 

At some point t, if Cp{t) < Cq(t), then the contribution 
to the projected EMD at this point t is 

\Cp{t)-cs)\-cs)-cAt) 

= {ErrQ{t) + -I>Q(t)) - {Errp{t) + <ip{t)) 
= \^p{t) - *Q(t)| - Errpit) + Errqit) 

Hence the value computed at this point t is the same using 
the normals and the errors as the projected EMD. If Cp(t) > 
Cq{t) then the contribution to the EMD is 

\Cp{t)-C,{t)\=Cp{t)-Cq{t) 

= {Errp{t) + $p(f)) - {Errgit) + ^Q{t)) 
= (<l>p(t) - <&Q(i)) + {Errp(t) - ErrQ(t)) 
> \^p{t) - - Errp{t) + ErrQ{t) 



The last line is due to the fact that wc assume $p -< 
$(3, so in order for Cp{t) to be greater than Cq{t), Errp{t) 
must be greater than ErrQ{t) and the difference between 
the normals. 

Thus for every t, the normal EMD plus/minus the er- 
ror terms from the two distributions will always be at most 
the actual value contributed to the projected EMD. Hence, 
EMDLB{^p,<S>Qy < EMD{projs^{P),projs,{Q)), which 
means that the normal lower bound is also less than the 
projected bound. The proof for the case where -< ^p is 
the same with the signs reversed. When $p(i) and ^Q{t) 
intersect, we can break the proof for the theorem into two 
parts; before the intersection and after. □ 

B. PARTIAL DOMINANCE PROOF 

To prove Theorem 5, wc need to show that the normal 
EMD terms and the error terms in the bound are both less 
than any point in AI. From the definition of the bound in 
Eqn. (8), we know that the error terms are always less than 
any error term in M. We now just need to show that the 
minimum normal EMD between Q and M lies on the line 
between Mis and M„. That is, 

mm {EMDM{$mi,<^Q)} > 

^ [SMDaa(*m„ , $0) + EMDMi^Mi, , ^q) 
-SMDaa($m„,$m„)] 

Proof. We can easily demonstrate this via contradiction. 

Assume that the minimum EMD normal is a point W not 
on the line between Mu and Mis, that is, inside the bounding 
region. Let us denote the intersection between $q and 
as tis- The set of all normal distributions that intersect $q 
at tis lie along a line in dominance space that passes through 
both Q and W. This line can be found by simple arithmetic 
and we omit the details due to space considerations. 

There now exists on the line from W to Q, another normal 
W' with same normal intersection point, but with cr^,/ > aw 
Meaning, with a variance that is closer to ffp. As and W' 
also intersect at tis, the CDF difference between Q and W' 
must be less than Q and W because the intersection points 
are all the same but a^i is closer to ap than a™. Hence, 
on this line from ly to Q, we can keep finding normals that 
have a smaller distance to Q than W, until we reach the line 
between M„ and Mis. Thus, the minimum normal distance 
must lie on the line between M„ and Mis. □ 

C. SPACE COMPLEXITY ANALYSIS 

The projection of each database object is represented by a 
normal approximation and error terms. This requires stor- 
age of the mean and variance of each distribution on the 
projection, as well as the minimum and maximum errors for 
all s sub-intervals. Therefore, any index that implements 
the normal lower bound requires 2NP + 2NPs = O(NPs) 
space, where A'^ is the database cardinality and P is the rmm- 
ber of projections. In contrast, the TBI method requires 
2LA'^ = 0{LN) space, where L is the number of feasible 
solutions that are implemented by the index. As shown in 
Section 6.2, the number of projections and sub-intervals is 
generally very small, achieving increased performance over 
the TBI method with only a modest increase in the space 
requirements. 
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